-
Notifications
You must be signed in to change notification settings - Fork 111
feat: support writing domain metadata (2/2) #1275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1275 +/- ##
==========================================
+ Coverage 84.55% 84.89% +0.33%
==========================================
Files 113 113
Lines 28009 28830 +821
Branches 28009 28830 +821
==========================================
+ Hits 23683 24475 +792
- Misses 3189 3197 +8
- Partials 1137 1158 +21 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fd368f5
to
dece4cd
Compare
a568124
to
2c50fd0
Compare
2c50fd0
to
f7319a6
Compare
f7319a6
to
05ae907
Compare
## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This PR is (1/2) to support writing domain metadata. 1. (this PR) support adding or updating a domain metadata configuration 2. (follow up) support _removing_ a domain metadata, #1275. - resolves #1270 Writing domain metadata, as per the [Delta protocol](https://arc.net/l/quote/fwjwtumy) requires: - The table must be on writer version 7, and a feature named `domainMetadata` must exist in the table's `writerFeatures`. We satisfy this via an explicit check in `Transaction::commit`. - Two overlapping transactions conflict if they both contain a domain metadata action for the same metadata domain. We satisfy this with Kernel's coarse grained conflict detection based on the fact that the `delta_log` dir already has a log file with the name this txn was going to write. No new logic needed in this PR. ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> Added new integration tests in `write.rs`.
f4bfcd5
to
ef7ca8a
Compare
ef7ca8a
to
7897691
Compare
kernel/src/actions/mod.rs
Outdated
pub(crate) fn remove(domain: String) -> Self { | ||
Self { | ||
domain, | ||
configuration: String::new(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this how existing implementations do removes? I kinda figured it would retain the old configuration but haven't dug into this yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great call out! i went back and looked at the protocol again. "Writers should preserve an accurate pre-image of the configuration" -> so we should keep it around actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retaining the old configuration would require log replay though, right? We could accept configuration: Option<String>
as an additional parameter to let the caller provide the old value if already known, but TBH I'm not sure if that improves the current solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that it requires log replay. But as long as we do that once it should be okay. I checked the Java version of the delta kernel for inspiration and they follow this pattern where remove takes only the domain
and they internally determine the pre-image configuration value to preserve. Code ptr: https://github.com/delta-io/delta/blob/master/kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionImpl.java#L230C15-L230C35.
The other thing is if we let users pass it in, we likely should validate that the pre-image they provide is correct. Which would require log replay anyways.
Co-authored-by: Zach Schuermann <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I left a few nits and one suggestion for improving test coverage.
kernel/src/actions/mod.rs
Outdated
pub(crate) fn remove(domain: String) -> Self { | ||
Self { | ||
domain, | ||
configuration: String::new(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retaining the old configuration would require log replay though, right? We could accept configuration: Option<String>
as an additional parameter to let the caller provide the old value if already known, but TBH I'm not sure if that improves the current solution.
kernel/src/transaction/mod.rs
Outdated
} | ||
|
||
/// Remove domain metadata from the Delta log. | ||
/// This creates a tombstone to logically delete the specified domain. We don't check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IMO it is good style to have a paragraph break (i.e., empty line) after the one-line summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I'm not following what you mean. Do you mean the "We don't check" should start on the next line perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant
/// Remove domain metadata from the Delta log.
///
/// This creates a tombstone to logically delete the specified domain. We don't check
See this clippy lint for reference.
kernel/src/actions/mod.rs
Outdated
} | ||
|
||
// Create a new DomainMetadata action to remove a domain. | ||
pub(crate) fn remove(domain: String) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like remove
is a somewhat misleading name for a method that is actually a constructor. I don't have a strong opinion here, but maybe something like new_tombstone
communicates the function's intent more clearly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests! Just some code reshuffling then we're good to go 👍
let user_domains = self | ||
.domain_metadatas | ||
.iter() | ||
.filter_map(move |dm: &DomainMetadata| { | ||
if dm.is_removed() { | ||
existing_domains.get(dm.domain()).map(|existing| { | ||
DomainMetadata::remove( | ||
dm.domain().to_string(), | ||
existing.configuration().to_string(), | ||
) | ||
}) | ||
} else { | ||
Some(dm.clone()) | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consume the domain_metadatas and avoid the clone since it seems like we don't use it anymore
let user_domains = self | |
.domain_metadatas | |
.iter() | |
.filter_map(move |dm: &DomainMetadata| { | |
if dm.is_removed() { | |
existing_domains.get(dm.domain()).map(|existing| { | |
DomainMetadata::remove( | |
dm.domain().to_string(), | |
existing.configuration().to_string(), | |
) | |
}) | |
} else { | |
Some(dm.clone()) | |
} | |
}); | |
let user_domains = self | |
.domain_metadatas | |
.into_iter() | |
.filter_map(move |dm: DomainMetadata| { | |
if !dm.is_removed() { | |
return Some(dm); | |
} | |
existing_domains.get(dm.domain()).map(|existing| { | |
DomainMetadata::remove( | |
dm.domain().to_string(), | |
existing.configuration().to_string(), | |
) | |
}) | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to use removed_domains
this becomes:
We can consume the domain_metadatas and avoid the clone since it seems like we don't use it anymore
let user_domains = self | |
.domain_metadatas | |
.iter() | |
.filter_map(move |dm: &DomainMetadata| { | |
if dm.is_removed() { | |
existing_domains.get(dm.domain()).map(|existing| { | |
DomainMetadata::remove( | |
dm.domain().to_string(), | |
existing.configuration().to_string(), | |
) | |
}) | |
} else { | |
Some(dm.clone()) | |
} | |
}); | |
let user_domains = self | |
.removed_domains | |
.iter() | |
.map(move |domain_name: &str| { | |
existing_domains.get(domain_name).map(|existing| { | |
DomainMetadata::remove( | |
dm.domain().to_string(), | |
existing.configuration().to_string(), | |
) | |
}) | |
}) | |
.chain(domain_metadatas); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will probably apply suggestion 1 or 2 depending on our takeaways from the other threads :)
// fetch previous configuration values (requires log replay) | ||
let existing_domains = if has_removals { | ||
scan_domain_metadatas(self.read_snapshot.log_segment(), None, engine)? | ||
} else { | ||
HashMap::new() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we need to verify that the domain metadata exists. After all, Delta never checks that a removed file is actually present. Is this something that we need to do? If so, can you put a comment explaining why we need to read the entire log and set the removed configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment (that the Delta spec requires it). let's bottom out on this in one of the above threads.
// actual configuration value determined during commit | ||
self.domain_metadatas | ||
.push(DomainMetadata::remove(domain, String::new())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having the dummy data and checking below whether we have a removal, why not maintain a separate removed_domains: HashSet<String>
that we fill in after the log replay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: only do this if we actually do need to preserve the removed domain metadata's configuration. See my comment here: https://github.com/delta-io/delta-kernel-rs/pull/1275/files#r2400384116
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Delta spec requires it for configuration values. See here: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#:~:text=When%20true%2C%20the%20action%20serves%20as%20a%20tombstone%20to%20logically%20delete%20a%20metadata%20domain.%20Writers%20should%20preserve%20an%20accurate%20pre%2Dimage%20of%20the%20configuration..
See this thread for the prior discussion on this: #1275 (comment).
if dm.is_internal() { | ||
return Err(Error::Generic( | ||
"Cannot modify domains that start with 'delta.' as those are system controlled" | ||
.to_string(), | ||
)); | ||
} | ||
if !domains.insert(domain_metadata.domain()) { | ||
|
||
if !seen_domains.insert(dm.domain()) { | ||
return Err(Error::Generic(format!( | ||
"Metadata for domain {} already specified in this transaction", | ||
domain_metadata.domain() | ||
dm.domain() | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These feel like checks that should be done in with_domain_metadata
:
if dm.is_internal()
if !seen_domains.insert(dm.domain())
Generally I prefer to fail early. Feel free to make this a followup issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree with the fail fast principle, but this case warrants an exception imo.
We're treating these methods as builder methods (return Self
, even though there is no explicit build
call) so we should present a fluent builder API. If we do the checking of the error conditions at this point, then we need to return DeltaResult<Self>
, which is an awkward interface for both Rust (chaining requires try operator), but particularly for engines across the FFI boundary.
wdyt?
if dm.is_removed() { | ||
has_removals = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linearly checking domain metadatas can be avoided by maintaining a separate hashmap. See here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's bottom out on deferred error checking or not above. Because if we choose to defer error checking, then we cannot make it a HashMap
. If we decide to not defer, then will refactor accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tysm for the review! wanted to align on 2 high-level things so added thoughts on.
- preserve pre-image of removed configuration
- error checking, fail-fast or in
commit
// actual configuration value determined during commit | ||
self.domain_metadatas | ||
.push(DomainMetadata::remove(domain, String::new())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Delta spec requires it for configuration values. See here: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#:~:text=When%20true%2C%20the%20action%20serves%20as%20a%20tombstone%20to%20logically%20delete%20a%20metadata%20domain.%20Writers%20should%20preserve%20an%20accurate%20pre%2Dimage%20of%20the%20configuration..
See this thread for the prior discussion on this: #1275 (comment).
if dm.is_internal() { | ||
return Err(Error::Generic( | ||
"Cannot modify domains that start with 'delta.' as those are system controlled" | ||
.to_string(), | ||
)); | ||
} | ||
if !domains.insert(domain_metadata.domain()) { | ||
|
||
if !seen_domains.insert(dm.domain()) { | ||
return Err(Error::Generic(format!( | ||
"Metadata for domain {} already specified in this transaction", | ||
domain_metadata.domain() | ||
dm.domain() | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree with the fail fast principle, but this case warrants an exception imo.
We're treating these methods as builder methods (return Self
, even though there is no explicit build
call) so we should present a fluent builder API. If we do the checking of the error conditions at this point, then we need to return DeltaResult<Self>
, which is an awkward interface for both Rust (chaining requires try operator), but particularly for engines across the FFI boundary.
wdyt?
if dm.is_removed() { | ||
has_removals = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's bottom out on deferred error checking or not above. Because if we choose to defer error checking, then we cannot make it a HashMap
. If we decide to not defer, then will refactor accordingly.
// fetch previous configuration values (requires log replay) | ||
let existing_domains = if has_removals { | ||
scan_domain_metadatas(self.read_snapshot.log_segment(), None, engine)? | ||
} else { | ||
HashMap::new() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment (that the Delta spec requires it). let's bottom out on this in one of the above threads.
What changes are proposed in this pull request?
This PR is (2/2) to support writing domain metadata. It adds support for removing metadata for user-specified domains. As per the Delta specification, for a removed domain metadata "writers should preserve an accurate pre-image of the configuration". Thus for any removals, we need to perform a log replay and restore the original configuration of the domain with the
removed
flag set totrue
. Furthermore, for any removal where the domain does not already exist in the log, we treat this as a no-op and do not write any record to the Delta Log.How was this change tested?
Added new integration tests in
write.rs
.